16. Under the Hood Part 1
It's working, but…
You've now implemented a basic neural network, with a forward pass, backpropagation, and the ability to train the network to output an updated model.
If backpropagation still feels like a mystery to you, let's walk through a brief example to hopefully give you more intuition on what's occurring at a low level in the network. If you already feel comfortable with backpropagation, feel free to skip this section.
We'll focus on a simpler backwards pass that performs similarly to our Linear class from before.
Defining our variables
To further understand backpropagation, let's take an example input, X:
X= \begin{bmatrix}x_{11}, x_{12}\\x_{21}, x_{22}\end{bmatrix}
- X has 2 data points (2 rows), with 2 features (2 columns).
- x_{11}, x_{12} are the features for data point 1
- x_{21}, x_{22} are the features for data point 2
Our layer will have a weights matrix, W:
W= \begin{bmatrix}w_1\\w_2\end{bmatrix}
- w_1, w_2 are weights used for feature 1 and 2 (i.e. x_{i1}, x_{i2}, where i is a given data point).
Lastly, our layer will have have a gradients matrix, G, coming back from the subsequent layer:
G= \begin{bmatrix}g_1\\g_2\end{bmatrix}
- g_1 is the gradient cost for data point 1 (x_1) while, g_2is the gradient cost for data point 2 (x_2).
We'll look at just one layer's backpropagation from the previous graph.
Forward and backward pass
We'll ignore the bias in the above graph for now.
Note that the forward pass here takes our input, X, and takes the dot product against the weights, W, to produce the value of the Linear layer, l_1. We're actually going to skip over what happens from l_1 onward, so we'll just show this forward pass as getting fed there each time.
x_{11} \to w_1 \to l_1
x_{12} \to w_2 \to l_1
x_{21} \to w_1 \to l_1
x_{22} \to w_2 \to l_1
Going backwards for backpropagation, we'll get the gradient g_i in place of l_1 from before:
x_{11} \gets w_1 \gets g_1
x_{12} \gets w_2 \gets g_1
x_{21} \gets w_1 \gets g_2
x_{22} \gets w_2 \gets g_2
It may be useful to note again here that the weights are tied to each feature from an input, while the gradients are tied to each input separately (i.e., the first weight ties to the first feature for both of our data points, while the first gradient ties to the first data point as a whole).